12 research outputs found
Weakly-Supervised Speech Pre-training: A Case Study on Target Speech Recognition
Self-supervised learning (SSL) based speech pre-training has attracted much
attention for its capability of extracting rich representations learned from
massive unlabeled data. On the other hand, the use of weakly-supervised data is
less explored for speech pre-training. To fill this gap, we propose a
weakly-supervised speech pre-training method based on speaker-aware speech
data. It adopts a similar training procedure to the widely-used masked speech
prediction based SSL framework, while incorporating additional target-speaker
enrollment information as an auxiliary input. In this way, the learned
representation is steered towards the target speaker even in the presence of
highly overlapping interference, allowing potential applications to tasks such
as target speech recognition. Our experiments on Libri2Mix and WSJ0-2mix
datasets show that the proposed model achieves significantly better ASR
performance compared to WavLM, the state-of-the-art SSL model with denoising
capability.Comment: Accepted by Interspeech; 5 pages, 1 figure, 3 table
Exploring the Integration of Speech Separation and Recognition with Self-Supervised Learning Representation
Neural speech separation has made remarkable progress and its integration
with automatic speech recognition (ASR) is an important direction towards
realizing multi-speaker ASR. This work provides an insightful investigation of
speech separation in reverberant and noisy-reverberant scenarios as an ASR
front-end. In detail, we explore multi-channel separation methods, mask-based
beamforming and complex spectral mapping, as well as the best features to use
in the ASR back-end model. We employ the recent self-supervised learning
representation (SSLR) as a feature and improve the recognition performance from
the case with filterbank features. To further improve multi-speaker recognition
performance, we present a carefully designed training strategy for integrating
speech separation and recognition with SSLR. The proposed integration using
TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5%
word error rate in reverberant WHAMR! test set, significantly outperforming an
existing mask-based MVDR beamforming and filterbank integration (28.9%).Comment: Accepted to IEEE WASPAA 202
Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data
Pre-training speech models on large volumes of data has achieved remarkable
success. OpenAI Whisper is a multilingual multitask model trained on 680k hours
of supervised speech data. It generalizes well to various speech recognition
and translation benchmarks even in a zero-shot setup. However, the full
pipeline for developing such models (from data collection to training) is not
publicly accessible, which makes it difficult for researchers to further
improve its performance and address training-related issues such as efficiency,
robustness, fairness, and bias. This work presents an Open Whisper-style Speech
Model (OWSM), which reproduces Whisper-style training using an open-source
toolkit and publicly available data. OWSM even supports more translation
directions and can be more efficient to train. We will publicly release all
scripts used for data preparation, training, inference, and scoring as well as
pre-trained models and training logs to promote open science.Comment: Accepted at ASRU 202
A Comparative Study on Transformer vs RNN in Speech Applications
Sequence-to-sequence models have been widely used in end-to-end speech
processing, for example, automatic speech recognition (ASR), speech translation
(ST), and text-to-speech (TTS). This paper focuses on an emergent
sequence-to-sequence model called Transformer, which achieves state-of-the-art
performance in neural machine translation and other natural language processing
applications. We undertook intensive studies in which we experimentally
compared and analyzed Transformer and conventional recurrent neural networks
(RNN) in a total of 15 ASR, one multilingual ASR, one ST, and two TTS
benchmarks. Our experiments revealed various training tips and significant
performance benefits obtained with Transformer for each task including the
surprising superiority of Transformer in 13/15 ASR benchmarks in comparison
with RNN. We are preparing to release Kaldi-style reproducible recipes using
open source and publicly available datasets for all the ASR, ST, and TTS tasks
for the community to succeed our exciting outcomes.Comment: Accepted at ASRU 201